TCGA Annotations

The goal of this notebook is to introduce you to the TCGA Annotations BigQuery table. You can find more detail about Annotations on the TCGA Wiki, but the key things to know are:

an annotation can refer to any "type" of TCGA "item" (eg patient, sample, portion, slide, analyte or aliquot), and
each annotation has a "classification" and a "category", both of which are drawn from controlled vocabularies.

The current set of annotation classifications includes: Redaction, Notification, CenterNotification, and Observation. The authority for Redactions and Notifications is the BCR (Biospecimen Core Resource), while CenterNotifications can come from any of the data-generating centers (GSC or GCC), and Observations from any authorized TCGA personnel. Within each classification type, there are several categories.

We will look at these further by querying directly on the Annotations table.

Note that annotations about patients, samples, and aliquots are separate from the clinical, biospecimen, and molecular data, and most patients, samples, and aliquots do not in fact have any annotations associated with them. It can be important, however, when creating a cohort or analyzing the molecular data associated with a cohort, to check for the existence of annotations.

As usual, in order to work with BigQuery, you need to import the python bigquery module (gcp.bigquery) and you need to know the name(s) of the table(s) you are going to be working with:



In [1]:

    
import gcp.bigquery as bq
annotations_BQtable = bq.Table('isb-cgc:tcga_201607_beta.Annotations')

Schema

Let's start by looking at the schema to see what information is available from this table:



In [2]:

    
%bigquery schema --table $annotations_BQtable









    Out[2]:

Item Types

Most of the schema fields come directly from the TCGA Annotations. First and foremost, an annotation is associated with an itemType, as described above. This can be a patient, an aliquot, etc. Let's see what the breakdown is of annotations according to item-type:



In [3]:

    
%%sql

SELECT itemTypeName, COUNT(*) AS n
FROM $annotations_BQtable
GROUP BY itemTypeName
ORDER BY n DESC









    Out[3]:





    itemTypeName n
Shipped Portion 1749
Aliquot 1729
Patient 1380
Analyte 789
Slide 552
Sample 114
Portion 9
    
(rows: 7, time: 1.0s,    69KB processed, job: job_cRN_-AkWjppd0jy_Ud4lAb4AFQo)

The length of the barcode in the itemBarcode field will depend on the value in the itemTypeName field: if the itemType is "Patient", then the barcode will be something like TCGA-E2-A15J, whereas if the itemType is "Aliquot", the barcode will be a full-length barcode, eg TCGA-E2-A15J-10A-01D-a12N-01.

Annotation Classifications and Categories

The next most important pieces of information about an annotation are the "classification" and "category". Each of these comes from a controlled vocabulary and each "classification" has a specific set of allowed "categories".

One important thing to understand is that if an aliquot carries some sort of disqualifying annotation, in general all other data from other samples or aliquots associated with that same patient should still be usable. On the other hand, if a patient carries some sort of disqualifying annotation, then that information should be considered prior to using any of the samples or aliquots derived from that patient.

To illustrate this, let's look at the most frequent annotation classifications and categories when the itemType is Patient:



In [4]:

    
%%sql

SELECT
  annotationClassification,
  annotationCategoryName,
  COUNT(*) AS n
FROM
  $annotations_BQtable
WHERE
  ( itemTypeName="Patient" )
GROUP BY
  annotationClassification,
  annotationCategoryName
HAVING ( n >= 50 )
ORDER BY
  n DESC









    Out[4]:





    annotationClassification annotationCategoryName n
Notification Prior malignancy 407
Notification Alternate sample pipeline 200
Notification History of unacceptable prior treatment related to a prior/other malignancy 139
Notification Synchronous malignancy 110
Notification Neoadjuvant therapy 102
Notification Item is noncanonical 81
    
(rows: 6, time: 2.0s,   308KB processed, job: job_nZ2y7s_1hNEd_FPiRPYSwoVPyWc)

The results of the previous query indicate that the majority of patient-level annotations are "Notifications", most frequently regarding prior malignancies. In most TCGA publications, "history of unacceptable prior treatment" and "item is noncanonical" notifications are treated as disqualifying annotations, and all data associated with those patients is not used in any analysis.

Let's make a slight modification to the last query to see what types of annotation categories and classifications we see when the item type is not patient:



In [5]:

    
%%sql

SELECT
  annotationClassification,
  annotationCategoryName,
  itemTypeName,
  COUNT(*) AS n
FROM
  $annotations_BQtable
WHERE
  ( itemTypeName!="Patient" )
GROUP BY
  annotationClassification,
  annotationCategoryName,
  itemTypeName
HAVING ( n >= 50 )
ORDER BY
  n DESC









    Out[5]:





    annotationClassification annotationCategoryName itemTypeName n
Notification Item is noncanonical Shipped Portion 1741
CenterNotification Item flagged DNU Aliquot 1057
Notification Item is noncanonical Slide 541
Notification Item is noncanonical Analyte 464
Observation General Analyte 179
CenterNotification Center QC failed Aliquot 153
Observation General Aliquot 116
Notification Item in special subset Analyte 104
Notification Barcode incorrect Aliquot 84
Redaction Genotype mismatch Aliquot 80
Notification Item is noncanonical Sample 67
Redaction Inadvertently shipped Aliquot 54
    
(rows: 12, time: 2.9s,   308KB processed, job: job_3Yv7RlfX-Pc9ft0pyOOQarPFD7U)

The results of the previous query indicate that the vast majority of annotations are at the aliquot level, and more specifically were submitted by one of the data-generating centers, indicating that the data derived from that aliquot is "DNU" (Do Not Use). In general, this should not affect any other aliquots derived from the same sample or any other samples derived from the same patient.

We see in the output of the previous query that a Notification that an "Item is noncanonical" can be applied to different types of items (eg slides and analytes). Let's investigate this a little bit further, for example let's count up these types of annotations by study (ie tumor-type):



In [6]:

    
%%sql

SELECT
  Study,
  COUNT(*) AS n
FROM
  $annotations_BQtable
WHERE
  ( annotationCategoryName="Item is noncanonical" )
GROUP BY
  Study
ORDER BY
  n DESC









    Out[6]:





    Study n
OV 743
GBM 519
KIRC 455
COAD 314
LUAD 238
LUSC 231
HNSC 212
READ 115
KICH 47
PRAD 27
CHOL 15
ACC 12
PAAD 8
BRCA 2
    
(rows: 14, time: 0.8s,   177KB processed, job: job_xMNKI2Ng_GENgTZzFbH0R_OAaUE)

and now let's pick one of these tumor types, and delve a little bit further:



In [7]:

    
%%sql

SELECT
  itemTypeName,
  COUNT(*) AS n
FROM
  $annotations_BQtable
WHERE
  ( annotationCategoryName="Item is noncanonical"
    AND Study="OV" )
GROUP BY
  itemTypeName
ORDER BY
  n DESC









    Out[7]:





    itemTypeName n
Slide 409
Shipped Portion 220
Analyte 110
Patient 3
Sample 1
    
(rows: 5, time: 0.9s,   247KB processed, job: job_cALh5aAuWIlhHpmRkCpKDoN8u4c)

Barcodes

As described above, an annotation is specific to a single TCGA "item" and the fields itemTypeName and itemBarcode are the most important keys to understanding which TCGA item carries the annotation. Because we use the fields ParticipantBarcode, SampleBarcode, and AliquotBarcode throughout our other TCGA BigQuery tables, we have added them to this table as well, but they should be interpreted with some care: when an annotation is specific to an aliquot (ie itemTypeName="Aliquot"), the ParticipantBarcode, SampleBarcode, and AliquotBarcode fields will all be set, but this should not be interpreted to mean that the annotation applies to all data derived from that patient.

This will be illustrated with the following two queries which extract information pertaining to a few specific patients:



In [8]:

    
%%sql

SELECT
 Study,
 itemTypeName,
 itemBarcode,
 annotationCategoryName,
 annotationClassification,
 ParticipantBarcode,
 SampleBarcode,
 AliquotBarcode,
 LENGTH(itemBarcode) AS n
FROM
  $annotations_BQtable
WHERE
  ( ParticipantBarcode="TCGA-61-1916" )
ORDER BY n ASC









    Out[8]:





    Study itemTypeName itemBarcode annotationCategoryName annotationClassification ParticipantBarcode SampleBarcode AliquotBarcode n
OV Patient TCGA-61-1916 Item in special subset Notification TCGA-61-1916     12
OV Analyte TCGA-61-1916-01A-01R Item is noncanonical Notification TCGA-61-1916 TCGA-61-1916-01A   20
OV Analyte TCGA-61-1916-02A-01T Item is noncanonical Notification TCGA-61-1916 TCGA-61-1916-02A   20
OV Analyte TCGA-61-1916-01A-01D Item is noncanonical Notification TCGA-61-1916 TCGA-61-1916-01A   20
OV Analyte TCGA-61-1916-02A-01R Item is noncanonical Notification TCGA-61-1916 TCGA-61-1916-02A   20
OV Analyte TCGA-61-1916-02A-01D Item is noncanonical Notification TCGA-61-1916 TCGA-61-1916-02A   20
OV Analyte TCGA-61-1916-11A-01D Item is noncanonical Notification TCGA-61-1916 TCGA-61-1916-11A   20
OV Analyte TCGA-61-1916-01A-01G Item is noncanonical Notification TCGA-61-1916 TCGA-61-1916-01A   20
OV Analyte TCGA-61-1916-02A-01W Item is noncanonical Notification TCGA-61-1916 TCGA-61-1916-02A   20
OV Analyte TCGA-61-1916-02A-01G Item is noncanonical Notification TCGA-61-1916 TCGA-61-1916-02A   20
OV Analyte TCGA-61-1916-11A-01W Item is noncanonical Notification TCGA-61-1916 TCGA-61-1916-11A   20
OV Analyte TCGA-61-1916-01A-01T Item is noncanonical Notification TCGA-61-1916 TCGA-61-1916-01A   20
OV Analyte TCGA-61-1916-01A-01W Item is noncanonical Notification TCGA-61-1916 TCGA-61-1916-01A   20
OV Slide TCGA-61-1916-01A-21-1559 Item is noncanonical Notification TCGA-61-1916 TCGA-61-1916-01A   24
OV Aliquot TCGA-61-1916-02A-01R-0808-01 General Observation TCGA-61-1916 TCGA-61-1916-02A TCGA-61-1916-02A-01R-0808-01 28
OV Aliquot TCGA-61-1916-01A-01D-0803-01 Item flagged DNU CenterNotification TCGA-61-1916 TCGA-61-1916-01A TCGA-61-1916-01A-01D-0803-01 28
    
(rows: 16, time: 0.8s,   727KB processed, job: job_Kms6IP7VWumzFlDh5X2qoeWxBBg)



In [9]:

    
%%sql

SELECT
 Study,
 itemTypeName,
 itemBarcode,
 annotationCategoryName,
 annotationClassification,
 ParticipantBarcode,
 SampleBarcode,
 AliquotBarcode,
 LENGTH(itemBarcode) AS n
FROM
  $annotations_BQtable
WHERE
  ( ParticipantBarcode="TCGA-GN-A261" )
ORDER BY n ASC









    Out[9]:





    Study itemTypeName itemBarcode annotationCategoryName annotationClassification ParticipantBarcode SampleBarcode AliquotBarcode n
SKCM Patient TCGA-GN-A261 Tumor tissue origin incorrect Redaction TCGA-GN-A261     12
SKCM Patient TCGA-GN-A261 Neoadjuvant therapy Notification TCGA-GN-A261     12
    
(rows: 2, time: 1.0s,   727KB processed, job: job_jGPJrgpn_baosYFarH9fRQp-ct4)

As you can see in the results returned from the previous two queries, the SampleBarcode and the AliquotBarcode fields may or may not be filled in, depending on the itemTypeName.



In [10]:

    
%%sql

SELECT
 Study,
 itemTypeName,
 itemBarcode,
 annotationCategoryName,
 annotationClassification,
 annotationNoteText,
 ParticipantBarcode,
 SampleBarcode,
 AliquotBarcode,
 LENGTH(itemBarcode) AS n
FROM
  $annotations_BQtable
WHERE
  ( ParticipantBarcode="TCGA-RS-A6TP" )
ORDER BY n ASC









    Out[10]:





    Study itemTypeName itemBarcode annotationCategoryName annotationClassification annotationNoteText ParticipantBarcode SampleBarcode AliquotBarcode n
HNSC Analyte TCGA-RS-A6TP-10A-01D General Observation DNA analyte UUID: 8304F61F-C217-4B9F-BA64-6486DA54E6C8 was involved in an extraction protocol deviation wherein an additional column purification step was used as a means of buffer exchange on the column-eluted analyte. TCGA-RS-A6TP TCGA-RS-A6TP-10A   20
    
(rows: 1, time: 1.1s,     1MB processed, job: job_X_hNrE1FA0XESLN6cFEhRUYznWQ)

In this example, there is just one annotation relevant to this particular patient, and one has to look at the annotationNoteText to find out what the potential issue may be with this particular analyte. Any aliquots derived from this blood-normal analyte might need to be used with care.



In [ ]:

itemTypeName	n
Shipped Portion	1749
Aliquot	1729
Patient	1380
Analyte	789
Slide	552
Sample	114
Portion	9

annotationClassification	annotationCategoryName	n
Notification	Prior malignancy	407
Notification	Alternate sample pipeline	200
Notification	History of unacceptable prior treatment related to a prior/other malignancy	139
Notification	Synchronous malignancy	110
Notification	Neoadjuvant therapy	102
Notification	Item is noncanonical	81

Study	n
OV	743
GBM	519
KIRC	455
COAD	314
LUAD	238
LUSC	231
HNSC	212
READ	115
KICH	47
PRAD	27
CHOL	15
ACC	12
PAAD	8
BRCA	2

Study	itemTypeName	itemBarcode	annotationCategoryName	annotationClassification	ParticipantBarcode	SampleBarcode	AliquotBarcode	n
OV	Patient	TCGA-61-1916	Item in special subset	Notification	TCGA-61-1916			12
OV	Analyte	TCGA-61-1916-01A-01R	Item is noncanonical	Notification	TCGA-61-1916	TCGA-61-1916-01A		20
OV	Analyte	TCGA-61-1916-02A-01T	Item is noncanonical	Notification	TCGA-61-1916	TCGA-61-1916-02A		20
OV	Analyte	TCGA-61-1916-01A-01D	Item is noncanonical	Notification	TCGA-61-1916	TCGA-61-1916-01A		20
OV	Analyte	TCGA-61-1916-02A-01R	Item is noncanonical	Notification	TCGA-61-1916	TCGA-61-1916-02A		20
OV	Analyte	TCGA-61-1916-02A-01D	Item is noncanonical	Notification	TCGA-61-1916	TCGA-61-1916-02A		20
OV	Analyte	TCGA-61-1916-11A-01D	Item is noncanonical	Notification	TCGA-61-1916	TCGA-61-1916-11A		20
OV	Analyte	TCGA-61-1916-01A-01G	Item is noncanonical	Notification	TCGA-61-1916	TCGA-61-1916-01A		20
OV	Analyte	TCGA-61-1916-02A-01W	Item is noncanonical	Notification	TCGA-61-1916	TCGA-61-1916-02A		20
OV	Analyte	TCGA-61-1916-02A-01G	Item is noncanonical	Notification	TCGA-61-1916	TCGA-61-1916-02A		20
OV	Analyte	TCGA-61-1916-11A-01W	Item is noncanonical	Notification	TCGA-61-1916	TCGA-61-1916-11A		20
OV	Analyte	TCGA-61-1916-01A-01T	Item is noncanonical	Notification	TCGA-61-1916	TCGA-61-1916-01A		20
OV	Analyte	TCGA-61-1916-01A-01W	Item is noncanonical	Notification	TCGA-61-1916	TCGA-61-1916-01A		20
OV	Slide	TCGA-61-1916-01A-21-1559	Item is noncanonical	Notification	TCGA-61-1916	TCGA-61-1916-01A		24
OV	Aliquot	TCGA-61-1916-02A-01R-0808-01	General	Observation	TCGA-61-1916	TCGA-61-1916-02A	TCGA-61-1916-02A-01R-0808-01	28
OV	Aliquot	TCGA-61-1916-01A-01D-0803-01	Item flagged DNU	CenterNotification	TCGA-61-1916	TCGA-61-1916-01A	TCGA-61-1916-01A-01D-0803-01	28

Study	itemTypeName	itemBarcode	annotationCategoryName	annotationClassification	ParticipantBarcode	SampleBarcode	AliquotBarcode	n
SKCM	Patient	TCGA-GN-A261	Tumor tissue origin incorrect	Redaction	TCGA-GN-A261			12
SKCM	Patient	TCGA-GN-A261	Neoadjuvant therapy	Notification	TCGA-GN-A261			12